-
Notifications
You must be signed in to change notification settings - Fork 708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Workflow output definition #1227
base: dev
Are you sure you want to change the base?
Conversation
Signed-off-by: Ben Sherman <[email protected]>
|
Signed-off-by: Ben Sherman <[email protected]>
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Liking this. If we really need the output block (rather than doing something with emit), this is a nice readable way of doing it. |
This is beginning to look great. All the publishing logic is in one location, easy to review and understand where it's coming from. There are two downsides to this approach:
|
Agreeing with Adam, it's a bit too implicit, especially what is a path what is a topic |
In the Nextflow PR there are some docs which explain the feature in more detail. Unfortunately the deploy preview isn't working so you'll have to look at the diff
Indeed this is the downside of selecting channels instead of processes. More flexible but more layers of indirection. We should be able to alleviate this with IDE tooling, i.e. hover over a selected channel to see it's definition
Thanks @pinin4fjords , I never responded to your idea about putting everything in the The main question now is, how to bring the outputs for PREPARE_GENOME and RNASEQ up to the top-level workflow? I was thinking some kind of |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
The current prototype simply maps the output channels to the publish directory structure, but we still need to get these outputs to the top level whereas currently they are nested under Before I go off and add a gajillion channels to the @adamrtalbot @pinin4fjords @maxulysse @ewels Since you guys understand this pipeline better than me, I'm wondering, how would you group all of these outputs if you could group them any way you want? You are no longer restricted to process selectors or directory names, but you could use those if you wanted. For example, I see the modules config for RNASEQ is grouped with these comments:
Would those be good top-level groupings for outputs? Then you might have topics called |
I managed to move everything to the top-level workflow, so it should be executable now (though there are likely some bugs, will test tomorrow). I ended up using topics for everything, using the various publish directories to guide the topic names. Hope this gives you a more concrete sense of how topics are useful. The topics don't really reduce the amount of code, they just split it between the output DSL and the workflow |
As Evan mentioned on Slack, this does seem very verbose:
But I understand why, since if even one of the outputs from a process needs to go to a different topic then you can't use the multi-channel object Rather than doing this from the calling workflow, could e.g. QUANTIFY_STAR_SALMON use a topic as part of its emit, to 'suggest' a classification for that channel?
Then, if we all used good standards (e.g. an ontology for topics for outputs), calling workflows could have very minimal logic for this, relying on what the components said about their outputs. The calling workflow would only need to decide what to do with the topics in its outputs. |
We can definitely move some of these topic mappings into the modules and subworkflows, that was going to be my next step. I also suspect that nf-core will be able to converge on a shared ontology for these things. I'd still rather keep the topic mapping separate from the emits though, as we will need the |
I moved most of the topic mappings into their respective subworkflows. It gets tricky when a workflow is used multiple times under a different name and with different publish behavior. For example, I can't set a "sensible default" in the subworkflow because I can't override the default later, I can only specify additional topics. Or I could specify a default and not use it in the output definition for rnaseq, instead re-mapping each alias to different topics as I am currently doing. However, keeping the topic mappings in the RNASEQ workflow is also tricky because the process/workflow might not be executed, in which case the topic mapping will fail. We might need to replicate the control flow in the topic:
if{ !params.skip_alignment && params.aligner == 'star_rsem' ) {
DESEQ2_QC_RSEM.out.rdata >> 'align-deseq2'
DESEQ2_QC_RSEM.out.pca_txt >> 'align-deseq2'
DESEQ2_QC_RSEM.out.pdf >> 'align-deseq2'
DESEQ2_QC_RSEM.out.dists_txt >> 'align-deseq2'
DESEQ2_QC_RSEM.out.size_factors >> 'align-deseq2'
DESEQ2_QC_RSEM.out.log >> 'align-deseq2'
} Totally doable, but unfortunate if we have to resort to it @adamrtalbot noted in Slack that most Nextflow pipelines don't come close to this level of complexity, so I wouldn't be opposed to moving forward with what we have and let the rnaseq maintainers sort out the details. Though we do need to address the last point about conditional topic mappings |
I'm loving the principle:
The multi import thing didn't occur to me. Could we use a variable sent in via the |
In this case I would manipulate the channel to what I wanted. If I had to use a topic I would use them at the last second. So again, as long as topics are optional I think everything can be handled reasonably well.
Presumably, if a
Even better, just tidy up the channels before making the topic:
I think my overall impression is topics are a nice sugar on top of existing channels, in which case most of the key logic should be in the channel manipulations. Topics are a way of turning a local channel into a global one and should do very little else.
That sounds like a bug 😆 |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Notes on the latest update:
Overall, everything is much more concise and more in line with what many people have suggested, to simply annotate the workflows with the publish paths. The output definition is no longer a comprehensive view of all outputs, but there is a degree of modularity, and you can be verbose in the output definition if you want to. @adamrtalbot thanks for your comments, makes me feel more confident about the prototype. I think all of the remaining TODOs can be addressed by refactoring some dataflow logic, it can be handled by the rnaseq devs. |
Really liking the way this is going now, it's going to be very tidy. Would it be feasible at some point to use some optional dynamism in the modules, to facilitate repeated usage?
|
Maybe in a future iteration. But related to this, we are interested in building on the concept of the samplesheet as a way to hold metadata for file collections in general, and it might be a better practice than trying to encode metadata in the filenames. For example Paolo has proposed that we have a method in the output DSL to automatically publish an index file for a given "rule": output {
directory 'results'
'star_salmon/intermeds/' {
index 'index.csv'
}
}
Of course you could also do this manually like in fetchngs, and I would like to add a stdlib function like |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
The redirect to It seems like the best delineation for what goes in the top-level publish block vs the workflow publish sections is, the workflows define what is published (including conditional logic) while the top-level publish def should define how things are being published (mode, whether to overwrite, content type, tags, etc). This is also good for modularity. Note that some subworkflows are now using params which is an anti-pattern. For this I recommend passing those params as workflow inputs to keep things modular. |
We have been trying to eliminate that when we see it |
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like how it simplifies things quite a bit.
My main issues with this is that the publish path is spread across so many files, which probably means we shouldn't be using this in nf-core in any of the resuable/modular parts, especially if we can't turn off anything without having params
for everything.
Paths that need meta data like sample name are not here, but I guess they're directly accessible in the process
?
@@ -62,6 +62,16 @@ workflow ALIGN_STAR { | |||
BAM_SORT_STATS_SAMTOOLS ( ch_orig_bam, fasta ) | |||
ch_versions = ch_versions.mix(BAM_SORT_STATS_SAMTOOLS.out.versions) | |||
|
|||
publish: | |||
ch_orig_bam >> (params.save_align_intermeds || params.save_umi_intermeds ? 'star_salmon/' : null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is fine for specific pipelines, but makes module reusability across pipelines more cumbersome. If we need to have ternary operators in every nf-core module to control whether an output is published, this would make the pipeline schema, potentially huge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, I cut some corners here by not passing the params as workflow inputs. In the final implementation I would do that so that you can choose whether to expose it as a param in your own pipeline
workflows/rnaseq/main.nf
Outdated
@@ -782,6 +782,127 @@ workflow RNASEQ { | |||
ch_multiqc_report = MULTIQC.out.report | |||
} | |||
|
|||
publish: | |||
QUANTIFY_STAR_SALMON.out.results >> 'star_salmon/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is going to be annoying to users/developers not knowing where a publish path is defined when trying to debug why their file is being written to another location.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main principle is that callers overwrite callees. But also I think we will extend the inspect
command to also show the resolved publish targets for the entire pipeline, so that it is more clear.
Signed-off-by: Ben Sherman <[email protected]>
Signed-off-by: Ben Sherman <[email protected]>
publish: | ||
ch_genome >> (params.save_reference ? 'genome' : null) | ||
ch_genome_index >> (params.save_reference ? 'genome/index' : null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's one way to shorten the publish section -- gather related channels for each publish target in the main body
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was going to suggest this for align_star
, quantify_rsem
etc, think that's the right thing to do.
But it does feel a bit like we're doing something publishing-specific in the main body of the workflow.
Could we do like:
publish:
Channel.empty().mix(
ch_splicesites,
ch_bbsplit_index,
ch_star_index,
ch_rsem_index,
ch_hisat2_index,
ch_salmon_index,
ch_kallisto_index,
) >> (params.save_reference ? 'genome/index' : null)
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes that also works
workflow { | ||
main: | ||
def ch_bam = Channel.empty() | ||
def ch_fasta = Channel.empty() | ||
def ch_fai = Channel.empty() | ||
|
||
BAM_MARKDUPLICATES_PICARD( ch_bam, ch_fasta, ch_fai ) | ||
|
||
publish: | ||
BAM_MARKDUPLICATES_PICARD.out.bam >> 'picard' | ||
BAM_MARKDUPLICATES_PICARD.out.metrics >> 'picard/metrics' | ||
BAM_MARKDUPLICATES_PICARD.out.bai >> 'picard' | ||
BAM_MARKDUPLICATES_PICARD.out.csi >> 'picard' | ||
BAM_MARKDUPLICATES_PICARD.out.stats >> 'picard/samtools_stats' | ||
BAM_MARKDUPLICATES_PICARD.out.flagstat >> 'picard/samtools_stats' | ||
BAM_MARKDUPLICATES_PICARD.out.idxstats >> 'picard/samtools_stats' | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of how to provide example publish targets for a module/subworkflow while keeping them in an entry workflow. This way the module/subworkflow can be invoked directly (similar to nf-wrap) and could even supplement/replace the nf-test DSL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been heavily in favour of this since I first heard it. Every component becomes a runnable thing.
QUANTIFY_STAR_SALMON.out >> 'star_salmon' | ||
QUANTIFY_STAR_SALMON.out.multiqc >> null | ||
QUANTIFY_STAR_SALMON.out.merged_counts_transcript >> null | ||
QUANTIFY_STAR_SALMON.out.merged_tpm_transcript >> null | ||
QUANTIFY_STAR_SALMON.out.versions >> null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of publishing all outputs of a module/subworkflow and disabling the ones you don't want
// TODO: !params.skip_alignment && params.aligner == 'star_rsem' | ||
// DESEQ2_QC_RSEM.out >> 'star_rsem/deseq2_qc' | ||
// DESEQ2_QC_RSEM.out.pca_multiqc >> null | ||
// DESEQ2_QC_RSEM.out.dists_multiqc >> null | ||
// DESEQ2_QC_RSEM.out.versions >> null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Already discussed this before, but channels like these need to be prepared in the main body if the process is conditionally executed, so that the process .out
is only accessed in the conditional block, for example:
main:
def deseq2_qc_rsem = Channel.empty()
if( !params.skip_alignment && params.aligner == 'star_rsem' ) {
DESEQ2_QC_RSEM( /* ... */ )
deseq2_qc_rsem = Channel.empty().mix(
DESEQ2_QC_RSEM.out.pdf,
DESEQ2_QC_RSEM.out.rdata,
// ...
)
}
publish:
deseq2_qc_rsem >> 'star_rsem/deseq2_qc'
Just updated this PR based on the second preview (nextflow-io/nextflow#5185) Publish sections have been simplified and consolidated somewhat. I feel more confident that we could get to a single publish section in RNASEQ and PREPARE_GENOME, so that e.g. nf-core modules/subworkflows aren't using params or publishers. It will require some refactoring the channels around workflow-based publishing instead of process-based publishing. Left some comments to this effect, pointing out some common examples |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really loving this now
publish: | ||
ch_genome >> (params.save_reference ? 'genome' : null) | ||
ch_genome_index >> (params.save_reference ? 'genome/index' : null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was going to suggest this for align_star
, quantify_rsem
etc, think that's the right thing to do.
But it does feel a bit like we're doing something publishing-specific in the main body of the workflow.
Could we do like:
publish:
Channel.empty().mix(
ch_splicesites,
ch_bbsplit_index,
ch_star_index,
ch_rsem_index,
ch_hisat2_index,
ch_salmon_index,
ch_kallisto_index,
) >> (params.save_reference ? 'genome/index' : null)
?
workflow { | ||
main: | ||
def ch_bam = Channel.empty() | ||
def ch_fasta = Channel.empty() | ||
def ch_fai = Channel.empty() | ||
|
||
BAM_MARKDUPLICATES_PICARD( ch_bam, ch_fasta, ch_fai ) | ||
|
||
publish: | ||
BAM_MARKDUPLICATES_PICARD.out.bam >> 'picard' | ||
BAM_MARKDUPLICATES_PICARD.out.metrics >> 'picard/metrics' | ||
BAM_MARKDUPLICATES_PICARD.out.bai >> 'picard' | ||
BAM_MARKDUPLICATES_PICARD.out.csi >> 'picard' | ||
BAM_MARKDUPLICATES_PICARD.out.stats >> 'picard/samtools_stats' | ||
BAM_MARKDUPLICATES_PICARD.out.flagstat >> 'picard/samtools_stats' | ||
BAM_MARKDUPLICATES_PICARD.out.idxstats >> 'picard/samtools_stats' | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been heavily in favour of this since I first heard it. Every component becomes a runnable thing.
publish: | ||
BAM_RSEQC.out.bamstat_txt >> 'rseqc/bam_stat' | ||
BAM_RSEQC.out.inferexperiment_txt >> 'rseqc/infer_experiment' | ||
BAM_RSEQC.out.junctionannotation_pdf >> 'rseqc/junction_annotation' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, we definitely need the .mix()
I think
This PR adds a workflow output definition based on nextflow-io/nextflow#4784. I'm still working through the pipeline, but once I'm done, I will have completely replaced publishDir using the output DSL.
See also nf-core/fetchngs#275 for ongoing discussion